feat: automatic retry and failover for rate-limited LLM requests#733
feat: automatic retry and failover for rate-limited LLM requests#733raheelshahzad wants to merge 2 commits intokatanemo:mainfrom
Conversation
There was a problem hiding this comment.
Thanks a lot for putting this change together @raheelshahzad . Please join our discord channel too. Overall looks good!
Left some comments in the PR and have some additional suggestions/comments on overall change,
- we should do exponential backoff on retries
- how do we ensure that we have not reached request timeout
- max_retries should be defined somewhere in config.yaml probably not in this PR but we should let developers define that var
- this code change needs an update to docs
- I think we should allow retry to same provider or at least let developers define if they want to retry to different provider. Consider following example,
model_providers:
- model: openai/gpt-4o
base_url: https://dsna-oai.openai.azure.com
access_key: $OPENAI_API_KEY
retry_on_ratelimit: true # new feature
retry_to_same_provider: true # this flag should only allow retry to same provider otherwise we should retry randomly to all models
- model: openai/gpt-5
base_url: https://dsna-oai.openai.azure.com
access_key: $OPENAI_API_KEY
crates/common/src/llm_providers.rs
Outdated
| self.providers.iter().find_map(|(key, provider)| { | ||
| if provider.internal != Some(true) | ||
| && provider.name != current_name | ||
| && key == &provider.name | ||
| { | ||
| Some(Arc::clone(provider)) | ||
| } else { | ||
| None | ||
| } | ||
| }) |
There was a problem hiding this comment.
should pick random model
| if res.status() == StatusCode::TOO_MANY_REQUESTS && attempts < max_attempts { | ||
| let providers = llm_providers.read().await; | ||
| if let Some(provider) = providers.get(¤t_resolved_model) { | ||
| if provider.retry_on_ratelimit == Some(true) { | ||
| if let Some(alt_provider) = providers.get_alternative(¤t_resolved_model) { | ||
| info!( | ||
| request_id = %request_id, | ||
| current_model = %current_resolved_model, | ||
| alt_model = %alt_provider.name, | ||
| "429 received, retrying with alternative model" | ||
| ); | ||
| current_resolved_model = alt_provider.name.clone(); | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
we need to add exponential backoff
| let mut current_resolved_model = resolved_model.clone(); | ||
| let mut current_client_request = client_request; | ||
| let mut attempts = 0; | ||
| let max_attempts = 2; // Original + 1 retry |
There was a problem hiding this comment.
this should be configurable
| ); | ||
| // Capture start time right before sending request to upstream | ||
| let request_start_time = std::time::Instant::now(); | ||
| let _request_start_system_time = std::time::SystemTime::now(); |
|
I looked through envoy retry semantics https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-field-config-route-v3-routeaction-retry-policy I think we should lean toward this design for retries. We don't have to implement this completely but we should implement bare minimal but following similar semantics / config, thoughts? |
d1aa3ac to
ca903d2
Compare
raheelshahzad
left a comment
There was a problem hiding this comment.
- Exponential backoff with configurable base and max intervals.
- Configurable
max_retries. retry_to_same_provideroption.- Random alternative selection when failing over to a different model.
- Documentation updates in the reference configuration.
- Comprehensive unit tests for all the above.
|
Thanks a lot Raheel for continuing to make plano better. We are getting there. This may be a slightly better way to specify retries, model_providers:
- model: openai/gpt-4o
access_key: $OPENAI_API_KEY
default: true
retry_policy:
num_retries: 2
# retry_on: [429] # default
# back_off:
# base_interval: 25ms # default
# max_interval: 250ms # default (10x base)
# failover:
# strategy: same_provider # default
# Need more control
- model: anthropic/claude-sonnet-4-0
access_key: $ANTHROPIC_API_KEY
retry_policy:
num_retries: 3
failover:
strategy: any
# Full control
- model: openai/gpt-4o-mini
access_key: $OPENAI_API_KEY
retry_policy:
num_retries: 2
retry_on: [429, 503]
back_off:
base_interval: 100ms
max_interval: 2000ms
failover:
providers:
- anthropic/claude-sonnet-4-0
# No retries (default, just omit retry_policy)
- model: mistral/ministral-3b-latest
access_key: $MISTRAL_API_KEY |
I like this developer experience, and would love to see an updated PR about it. This would help with free-tier GPU traffic shaping and a very useful feature for coding agents. |
ca903d2 to
1384982
Compare
1384982 to
d569d4f
Compare
Implement a retry-on-ratelimit system for the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent provider selection. Core modules (crates/common/src/retry/): - orchestrator: retry loop with budget tracking and attempt management - provider_selector: weighted selection excluding blocked providers - error_detector: classifies responses into retryable error categories - backoff: exponential backoff with jitter and Retry-After support - retry_after_state: per-provider rate-limit cooldown tracking - latency_block_state: high-latency provider temporary exclusion - latency_trigger: consecutive slow-response counter - validation: configuration validation with cross-field checks - error_response: structured error responses when retries exhausted Three phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover). Tests follow in a separate PR.
…elimit Add 302 property-based unit tests (proptest, 100+ iterations each) and 13 integration test scenarios covering all retry behaviors. Unit tests cover: - Configuration round-trip parsing, defaults, and validation - Status code range expansion and error classification - Exponential backoff formula, bounds, and scope filtering - Provider selection strategy correctness and fallback ordering - Retry-After state scope behavior and max expiration updates - Cooldown exclusion invariants and initial selection cooldown - Bounded retry (max_attempts + budget enforcement) - Request preservation across retries - Latency trigger sliding window and block state management - Timeout vs high-latency precedence - Error response detail completeness Integration tests (tests/e2e/): - IT-1 through IT-13 covering 429/503 retry, exhaustion, backoff, fallback priority, Retry-After honoring, timeout retry, high-latency failover, streaming preservation, and body preservation
d569d4f to
98bf024
Compare
Summary
Adds a retry-on-ratelimit system to the Plano gateway that automatically retries failed LLM requests (429, 503, timeouts) across alternative providers with intelligent selection.
Structure (2 commits)
Commit 1 — Production code (~4k lines)
Core retry engine in
crates/common/src/retry/:orchestrator: retry loop with budget trackingprovider_selector: weighted selection excluding blocked providerserror_detector: classifies responses into retryable categoriesbackoff: exponential backoff with jitter + Retry-After supportretry_after_state: per-provider rate-limit cooldown trackinglatency_block_state: high-latency provider temporary exclusionlatency_trigger: consecutive slow-response countervalidation: config validation with cross-field checkserror_response: structured error responses when retries exhaustedThree phases: P0 (core retry + backoff), P1 (Retry-After + fallback models + timeout), P2 (proactive high-latency failover).
Commit 2 — Tests (~10.9k lines)
proptest, 100+ iterations each)